Skip to content

OCPEDGE-1036: fix: latency tuning for the rt-kernel tests on AWS metal#30790

Open
jeff-roche wants to merge 1 commit intoopenshift:mainfrom
jeff-roche:rt-test-latencies
Open

OCPEDGE-1036: fix: latency tuning for the rt-kernel tests on AWS metal#30790
jeff-roche wants to merge 1 commit intoopenshift:mainfrom
jeff-roche:rt-test-latencies

Conversation

@jeff-roche
Copy link
Contributor

@jeff-roche jeff-roche commented Feb 17, 2026

Summary

Replaces the binary pass/fail latency detection with a smarter three-tier analysis that distinguishes real RT kernel issues from environmental noise (e.g., isolated single-CPU spikes on AWS metal instances).

Changes

Two-tier soft/hard thresholds:

  • Soft threshold: expected max latency — exceeding triggers a warning but the test still passes
  • Hard threshold: absolute max latency — exceeding triggers a test failure

Statistical percentage-based detection:

  • If >10% of CPUs exceed the soft threshold, the test fails as a systemic latency issue
  • Isolated spikes on a single CPU (e.g., 1 out of 80) produce a warning, not a failure

Structured JSON diagnostic artifacts:

  • Each latency test now writes an _analysis.json artifact with: max, avg, P99 latency, per-CPU breakdown, soft/hard threshold counts, and overall result (PASS/WARN/FAIL)

Thresholds

Test Metal Soft Metal Hard Non-Metal Soft Non-Metal Hard
oslat 150us 500us 7500us 10000us
cyclictest 150us 500us 7500us 10000us
hwlatdetect 100us 500us 7500us 10000us
deadline_test 100us 500us 7500us 10000us

Code cleanup

  • Unified parseOslatResults and parseCyclictestResults into a single parseLatencyResults function
  • Added unit tests for the new parsing logic

Expected behavior with real job data

Scenario Old Behavior New Behavior
1/80 CPUs at 211us FAIL WARN (pass)
1/91 CPUs at 241us FAIL WARN (pass)
10/80 CPUs at 300us FAIL FAIL (systemic)
1/80 CPUs at 600us FAIL FAIL (hard threshold)
All CPUs under 100us PASS PASS

Summary by CodeRabbit

Release Notes

  • New Features

    • Per-test latency thresholding for real-time performance tests.
    • Enhanced latency analysis with additional metrics (max, average, P99, CPU overruns).
    • Structured per-test artifact logging with timestamps for better result tracking.
  • Improvements

    • Improved error reporting with contextual information for test failures.
  • Tests

    • Added comprehensive unit tests for latency result parsing and threshold handling.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 17, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 17, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

  • 100 microsecond latency cap for metal instances
  • 7500 microsecond latency cap for non-metal instance types (previous default)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jeff-roche
Copy link
Contributor Author

/test ?

@openshift-ci openshift-ci bot requested review from deads2k and sjenning February 17, 2026 00:17
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2026
@jeff-roche
Copy link
Contributor Author

/test e2e-gcp-ovn-rt-upgrade

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 17, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1e7a2f80-0b96-11f1-9252-e810dd3e02ff-0

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@jeff-roche
Copy link
Contributor Author

jeff-roche commented Feb 17, 2026

The payload job fails to upgrade but the RT Tests themselves pass. Addressing the upgrade failures on #30608

@qJkee
Copy link
Contributor

qJkee commented Feb 17, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 17, 2026
@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 17, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/61996750-0c26-11f1-8539-63c794c57c62-0

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 24, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9cdf16c0-11a8-11f1-9f71-c9b6ff3ae133-0

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4d5afc30-1253-11f1-9df8-fe79311f410f-0

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Walkthrough

getRealTimeWorkerNodes now returns a slice of node names and detects non-metal nodes (which forces all real-time thresholds to 7500µs). Real-time test runners were refactored to use per-test thresholds, capture outputs, produce structured latency analyses, and write timestamped per-test artifacts.

Changes

Cohort / File(s) Summary
Real-time Worker Node Collection
test/extended/kernel/common.go
Signature changed to return []string. Collects node names into a slice, detects instance-type labels containing "metal", logs when non-metal nodes exist, and updates rtTestThresholds to softer thresholds when any non-metal node is found.
Latency Thresholds & Test Runners
test/extended/kernel/tools.go
Introduced per-test threshold types and rtTestThresholds map. Refactored runners (runDeadlineTest, runHwlatdetect, runOslat, runCyclictest, runPiStressFifo, runPiStressRR) to use test-specific names/thresholds, capture command output, parse via parseLatencyResults, generate rtLatencyAnalysis, and write timestamped log/json artifacts.
Unit Tests for Parsing Logic
test/extended/kernel/tools_test.go
Added comprehensive unit tests for parseLatencyResults, covering pass/warn/fail scenarios, statistics (max/avg/p99), malformed/empty input handling, and threshold boundary cases.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as "Test Runner"
    participant Tool as "External Test Binary"
    participant Parser as "parseLatencyResults"
    participant Thresholds as "rtTestThresholds"
    participant Artifact as "writeAnalysisArtifact / writeTestArtifacts"
    participant Logger as "e2e.Logf"

    Runner->>Tool: execute test (capture stdout/stderr)
    Tool-->>Runner: JSON/text output
    Runner->>Parser: send output + testName
    Parser->>Thresholds: fetch soft/hard thresholds for testName
    Thresholds-->>Parser: return thresholds
    Parser-->>Runner: rtLatencyAnalysis (PASS/WARN/FAIL + metrics)
    Runner->>Artifact: write json/log artifacts (timestamped)
    Runner->>Logger: log summary, warnings, or errors
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding latency tuning (thresholds) for RT-kernel tests on AWS metal instances, which aligns with the PR's core objective of implementing per-test, per-instance-type latency thresholds.
Stable And Deterministic Test Names ✅ Passed The modified files contain only test infrastructure helper functions and standard Go unit tests with deterministic names, not Ginkgo test definitions with dynamic identifiers.
Test Structure And Quality ✅ Passed The new test file demonstrates high quality following Go conventions with seven focused test functions, meaningful assertions with diagnostic messages, and clean helper functions.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/44998af0-130a-11f1-9082-40e95c4d00b8-0

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 26, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

  • 100 microsecond latency cap for metal instances
  • 7500 microsecond latency cap for non-metal instance types (previous default)

Summary by CodeRabbit

  • Tests
  • Enhanced real-time kernel test infrastructure with adaptive threshold configuration based on hardware type detection.
  • Improved test logging and artifact collection with timestamped output files for better result tracking and debugging.
  • Refactored test execution with better error handling and context reporting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
test/extended/kernel/common.go (1)

64-68: Side effect in getter function modifies global state.

getRealTimeWorkerNodes modifies the global rtTestThresholds map, which is unexpected for a function with a "get" prefix. This couples threshold configuration to node discovery and makes the behavior harder to reason about.

Consider either:

  1. Renaming the function to reflect it configures thresholds (e.g., setupRealTimeWorkerNodes)
  2. Returning the metal status and handling threshold adjustment at the call site
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/kernel/common.go` around lines 64 - 68, getRealTimeWorkerNodes
currently mutates the global rtTestThresholds map (when nodesAreMetal is false),
which is a surprising side effect for a getter; stop modifying rtTestThresholds
inside getRealTimeWorkerNodes and instead either (A) rename
getRealTimeWorkerNodes to setupRealTimeWorkerNodes if you intend it to configure
thresholds, or (B) change getRealTimeWorkerNodes to only return the metal status
(bool nodesAreMetal) and move the rtTestThresholds adjustments out to the call
site so callers can set rtTestThresholds[test] = 7500 when nodesAreMetal is
false; update all callers of getRealTimeWorkerNodes accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/kernel/common.go`:
- Around line 57-68: The metal-detection currently uses node.GetLabels() and
sets nodesAreMetal = false if any worker node isn't metal, which incorrectly
flags clusters where non-RT workers are non-metal; update the logic so the metal
check is only performed for nodes that match the RT kernel condition (the same
condition used to select RT nodes) — i.e., inside the RT kernel match block
iterate those nodes, call node.GetLabels(), and only then modify nodesAreMetal
and adjust rtTestThresholds; reference variables/functions: node.GetLabels(),
nodesAreMetal, rtTestThresholds, and the RT kernel match condition so the
threshold padding runs only when RT nodes are detected as non-metal.
- Line 48: Replace the incorrect capacity argument on the nodes slice
allocation: the current call uses kubeNodes.Size() (which returns protobuf
serialized size) when constructing nodes via make([]string, 0, ...); change it
to use the number of items with len(kubeNodes.Items) so nodes = make([]string,
0, len(kubeNodes.Items)). Update the allocation site that references kubeNodes
and the nodes variable in test/extended/kernel/common.go (search for the
make([]string, 0, kubeNodes.Size()) occurrence).

In `@test/extended/kernel/tools.go`:
- Around line 165-167: The error message in runCyclictest incorrectly references
"oslat test"; update the returned fmt.Errorf string in the runCyclictest
function (where cpuCount is checked) to reference "cyclictest" (or
"runCyclictest") instead and preserve the numeric cpuCount interpolation and
wording; ensure only the test name in the message is changed so the check using
cpuCount and the fmt.Errorf call remain otherwise identical.

---

Nitpick comments:
In `@test/extended/kernel/common.go`:
- Around line 64-68: getRealTimeWorkerNodes currently mutates the global
rtTestThresholds map (when nodesAreMetal is false), which is a surprising side
effect for a getter; stop modifying rtTestThresholds inside
getRealTimeWorkerNodes and instead either (A) rename getRealTimeWorkerNodes to
setupRealTimeWorkerNodes if you intend it to configure thresholds, or (B) change
getRealTimeWorkerNodes to only return the metal status (bool nodesAreMetal) and
move the rtTestThresholds adjustments out to the call site so callers can set
rtTestThresholds[test] = 7500 when nodesAreMetal is false; update all callers of
getRealTimeWorkerNodes accordingly.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3cec87c and 8de1caf.

📒 Files selected for processing (2)
  • test/extended/kernel/common.go
  • test/extended/kernel/tools.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 2, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

  • 100 microsecond latency cap for metal instances
  • 7500 microsecond latency cap for non-metal instance types (previous default)

Summary by CodeRabbit

  • Tests
  • Enhanced real-time kernel test infrastructure with adaptive per-test thresholds when non-metal nodes are detected.
  • Improved test execution with unified per-test names, richer error context, and consistent output capture.
  • Added timestamped artifact logging for each test to simplify result tracking and debugging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 5, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

Summary

Replaces the binary pass/fail latency detection with a smarter three-tier analysis that distinguishes real RT kernel issues from environmental noise (e.g., isolated single-CPU spikes on AWS metal instances).

Changes

Two-tier soft/hard thresholds:

  • Soft threshold: expected max latency — exceeding triggers a warning but the test still passes
  • Hard threshold: absolute max latency — exceeding triggers a test failure

Statistical percentage-based detection:

  • If >5% of CPUs exceed the soft threshold, the test fails as a systemic latency issue
  • Isolated spikes on a single CPU (e.g., 1 out of 80) produce a warning, not a failure

Structured JSON diagnostic artifacts:

  • Each latency test now writes an _analysis.json artifact with: max, avg, P99 latency, per-CPU breakdown, soft/hard threshold counts, and overall result (PASS/WARN/FAIL)

Thresholds

Test Metal Soft Metal Hard Non-Metal Soft Non-Metal Hard
oslat 100us 500us 7500us 10000us
cyclictest 100us 500us 7500us 10000us
hwlatdetect 100us 200us 7500us 10000us
deadline_test 100us 200us 7500us 10000us

Code cleanup

  • Unified parseOslatResults and parseCyclictestResults into a single parseLatencyResults function
  • Added unit tests for the new parsing logic

Expected behavior with real job data

Scenario Old Behavior New Behavior
1/80 CPUs at 211us FAIL WARN (pass)
1/91 CPUs at 241us FAIL WARN (pass)
10/80 CPUs at 300us FAIL FAIL (systemic)
1/80 CPUs at 600us FAIL FAIL (hard threshold)
All CPUs under 100us PASS PASS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7e364250-18ad-11f1-9802-7c1b2e727428-0

@jeff-roche
Copy link
Contributor Author

/test e2e-gcp-ovn-rt-upgrade

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@jeff-roche
Copy link
Contributor Author

/retest

1 similar comment
@jeff-roche
Copy link
Contributor Author

/retest

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/09f90480-1962-11f1-9f39-e2de7efda1ab-0

Replace binary pass/fail latency detection with a three-tier analysis:
- Two-tier thresholds (soft/hard) to distinguish warnings from failures
- Statistical percentage-based detection (>5% CPUs over soft = systemic fail)
- Structured JSON diagnostic artifacts for richer test result analysis

Metal thresholds: oslat/cyclictest soft=100us hard=500us,
hwlatdetect/deadline_test soft=100us hard=200us.
Non-metal thresholds: soft=7500us hard=10000us.

Unifies parseOslatResults and parseCyclictestResults into a single
parseLatencyResults function with comprehensive statistics (max, avg,
P99, per-CPU breakdown). Adds unit tests for the new parsing logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 9, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dbcf1fa0-1c03-11f1-9335-cf4fd9a8554d-0

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 9, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

Summary

Replaces the binary pass/fail latency detection with a smarter three-tier analysis that distinguishes real RT kernel issues from environmental noise (e.g., isolated single-CPU spikes on AWS metal instances).

Changes

Two-tier soft/hard thresholds:

  • Soft threshold: expected max latency — exceeding triggers a warning but the test still passes
  • Hard threshold: absolute max latency — exceeding triggers a test failure

Statistical percentage-based detection:

  • If >5% of CPUs exceed the soft threshold, the test fails as a systemic latency issue
  • Isolated spikes on a single CPU (e.g., 1 out of 80) produce a warning, not a failure

Structured JSON diagnostic artifacts:

  • Each latency test now writes an _analysis.json artifact with: max, avg, P99 latency, per-CPU breakdown, soft/hard threshold counts, and overall result (PASS/WARN/FAIL)

Thresholds

Test Metal Soft Metal Hard Non-Metal Soft Non-Metal Hard
oslat 100us 500us 7500us 10000us
cyclictest 100us 500us 7500us 10000us
hwlatdetect 100us 200us 7500us 10000us
deadline_test 100us 200us 7500us 10000us

Code cleanup

  • Unified parseOslatResults and parseCyclictestResults into a single parseLatencyResults function
  • Added unit tests for the new parsing logic

Expected behavior with real job data

Scenario Old Behavior New Behavior
1/80 CPUs at 211us FAIL WARN (pass)
1/91 CPUs at 241us FAIL WARN (pass)
10/80 CPUs at 300us FAIL FAIL (systemic)
1/80 CPUs at 600us FAIL FAIL (hard threshold)
All CPUs under 100us PASS PASS

Summary by CodeRabbit

Release Notes

  • New Features

  • Per-test latency thresholding for real-time performance tests.

  • Enhanced latency analysis with additional metrics (max, average, P99, CPU overruns).

  • Structured per-test artifact logging with timestamps for better result tracking.

  • Improvements

  • Improved error reporting with contextual information for test failures.

  • Tests

  • Added comprehensive unit tests for latency result parsing and threshold handling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/extended/kernel/tools_test.go (1)

33-187: Add regression coverage for the real policy boundary and default table.

Right now the suite only exercises custom thresholds plus 1% and 10% soft-threshold cases. A drift in the documented 5% cutoff, or in the package defaults for rtTestThresholds, would still pass. Please add cases for exactly 5% vs 6% and a small assertion on the default threshold map.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/kernel/tools_test.go` around lines 33 - 187, Add tests that
cover the 5% boundary and the next integer (6%) and assert the package default
threshold map; specifically add a test that builds a report where exactly 5% of
CPUs exceed SoftThreshold (e.g., for 100 CPUs set 5 CPUs > SoftThreshold) and
assert parseLatencyResults returns PASS (or WARN/Fail per policy) and another
where 6% exceed and assert the opposite result, using parseLatencyResults and
rtThresholdConfig to construct thresholds; also add a small test that asserts
rtTestThresholds (the default threshold map) contains the expected default
SoftThreshold/HardThreshold entries for the relevant tests to catch regressions
in defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/kernel/tools.go`:
- Around line 91-120: runDeadlineTest and runHwlatdetect currently just pass a
single hard threshold and return on process exit, skipping the shared
PASS/WARN/FAIL analysis and not emitting an *_analysis.json; modify
runDeadlineTest and runHwlatdetect to feed the command output (res) and
thresholds (rtTestThresholds[testName]) into the same analysis routine used for
oslat/cyclictest (or implement a shared function analyzeTestOutput(output
string, thresholds Thresholds) that applies hard/soft thresholds and %‑over‑soft
rules), call writeTestArtifacts for both the raw log and the generated
<testName>_analysis.json, and return an error only when the analysis determines
FAIL (preserving existing error wrapping behavior using errors.Wrap with the
same messages).
- Around line 56-63: The systemic soft-threshold cutoff and per-test defaults
were loosened; revert them to the contract values by changing
maxSoftThresholdViolationPercent from 10.0 back to the contract value (e.g.,
5.0) and update rtTestThresholds so oslat and cyclictest use SoftThreshold: 100
(not 150) and all tests (deadline_test, oslat, cyclictest, hwlatdetect) use the
expected HardThreshold: 200 (not 500) so PASS/WARN/FAIL behavior matches the PR
contract.

---

Nitpick comments:
In `@test/extended/kernel/tools_test.go`:
- Around line 33-187: Add tests that cover the 5% boundary and the next integer
(6%) and assert the package default threshold map; specifically add a test that
builds a report where exactly 5% of CPUs exceed SoftThreshold (e.g., for 100
CPUs set 5 CPUs > SoftThreshold) and assert parseLatencyResults returns PASS (or
WARN/Fail per policy) and another where 6% exceed and assert the opposite
result, using parseLatencyResults and rtThresholdConfig to construct thresholds;
also add a small test that asserts rtTestThresholds (the default threshold map)
contains the expected default SoftThreshold/HardThreshold entries for the
relevant tests to catch regressions in defaults.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bfd9f28e-5340-408c-82ed-35728aaf79fd

📥 Commits

Reviewing files that changed from the base of the PR and between 1059dd7 and ced9954.

📒 Files selected for processing (3)
  • test/extended/kernel/common.go
  • test/extended/kernel/tools.go
  • test/extended/kernel/tools_test.go

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 9, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

Summary

Replaces the binary pass/fail latency detection with a smarter three-tier analysis that distinguishes real RT kernel issues from environmental noise (e.g., isolated single-CPU spikes on AWS metal instances).

Changes

Two-tier soft/hard thresholds:

  • Soft threshold: expected max latency — exceeding triggers a warning but the test still passes
  • Hard threshold: absolute max latency — exceeding triggers a test failure

Statistical percentage-based detection:

  • If >10% of CPUs exceed the soft threshold, the test fails as a systemic latency issue
  • Isolated spikes on a single CPU (e.g., 1 out of 80) produce a warning, not a failure

Structured JSON diagnostic artifacts:

  • Each latency test now writes an _analysis.json artifact with: max, avg, P99 latency, per-CPU breakdown, soft/hard threshold counts, and overall result (PASS/WARN/FAIL)

Thresholds

Test Metal Soft Metal Hard Non-Metal Soft Non-Metal Hard
oslat 150us 500us 7500us 10000us
cyclictest 150us 500us 7500us 10000us
hwlatdetect 100us 500us 7500us 10000us
deadline_test 100us 500us 7500us 10000us

Code cleanup

  • Unified parseOslatResults and parseCyclictestResults into a single parseLatencyResults function
  • Added unit tests for the new parsing logic

Expected behavior with real job data

Scenario Old Behavior New Behavior
1/80 CPUs at 211us FAIL WARN (pass)
1/91 CPUs at 241us FAIL WARN (pass)
10/80 CPUs at 300us FAIL FAIL (systemic)
1/80 CPUs at 600us FAIL FAIL (hard threshold)
All CPUs under 100us PASS PASS

Summary by CodeRabbit

Release Notes

  • New Features

  • Per-test latency thresholding for real-time performance tests.

  • Enhanced latency analysis with additional metrics (max, average, P99, CPU overruns).

  • Structured per-test artifact logging with timestamps for better result tracking.

  • Improvements

  • Improved error reporting with contextual information for test failures.

  • Tests

  • Added comprehensive unit tests for latency result parsing and threshold handling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@jeff-roche
Copy link
Contributor Author

/retest

@jeff-roche
Copy link
Contributor Author

/verified by payload job

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 10, 2026
@openshift-ci-robot
Copy link

@jeff-roche: This PR has been marked as verified by payload job.

Details

In response to this:

/verified by payload job

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@eggfoobar
Copy link
Contributor

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-gcp-ovn-rt-rhcos10-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-gcp-ovn-rt-rhcos10-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3d688e50-1c80-11f1-8471-5c7c808c8675-0

@eggfoobar
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eggfoobar, jeff-roche, qJkee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jeff-roche
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2026

@jeff-roche: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-microshift ced9954 link true /test e2e-aws-ovn-microshift

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants